{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Ensemble Predictions From Multiple Models\n",
    "_**Combining a Linear-Learner with XGBoost for  superior predictive peformance**_\n",
    "\n",
    "---\n",
    "\n",
    "---\n",
    "\n",
    "## Contents\n",
    "\n",
    "1. [Background](#Background)\n",
    "1. [Prepration](#Preparation)\n",
    "1. [Data](#Data)\n",
    "    1. [Exploration and Transformation](#Exploration) \n",
    "1. [Training Xgboost model using SageMaker](#Training)\n",
    "1. [Hosting the model](#Hosting)\n",
    "1. [Evaluating the model on test samples](#Evaluation)\n",
    "1. [Training a second Logistic Regression model using SageMaker](#Linear-Model)\n",
    "1. [Hosting the Second model](#Hosting:Linear-Learner)\n",
    "1. [Evaluating the model on test samples](#Prediction:Linear-Learner)\n",
    "1. [Combining the model results](#Ensemble)\n",
    "1. [Evaluating the combined model on test samples](#Evaluate-Ensemble)\n",
    "1. [Exentsions](#Extensions)\n",
    "\n",
    "---\n",
    "\n",
    "## Background\n",
    "Quite often, in pratical applications of Machine-Learning on predictive tasks, one model doesn't suffice. Most of the prediction competitions typically require combining forecasts from multiple sources to get an improved forecast. By combining or averaging predictions from multiple sources/models we typically get an improved forecast. This happens as there is considerable uncertainty in the choice of the model and there is no one true model in many practical applications. It is therefore beneficial to combine predictions from different models. In the Bayesian literature, this idea is referred as Bayesian Model Averaging http://www.stat.colostate.edu/~jah/papers/statsci.pdf and has been shown to work much better than just picking one model.\n",
    "\n",
    "\n",
    "This notebook presents an illustrative example to predict if a person makes over 50K a year based on information about their education, work-experience, geneder etc.\n",
    "\n",
    "* Preparing your _SageMaker_ notebook\n",
    "* Loading a dataset from S3 using SageMaker\n",
    "* Investigating and transforming the data so that it can be fed to _SageMaker_ algorithms\n",
    "* Estimating a model using SageMaker's XGBoost (eXtreme Gradient Boosting) algorithm\n",
    "* Hosting the model on SageMaker to make on-going predictions\n",
    "* Estimating a second model using SageMaker's Linear Learner method\n",
    "* Combining the predictions from both the models and evluating the combined prediction\n",
    "* Generating final predictions on the test data set\n",
    "\n",
    "\n",
    "---\n",
    "\n",
    "## Setup\n",
    "\n",
    "Let's start by specifying:\n",
    "\n",
    "* The SageMaker role arn used to give learning and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto call with a the appropriate full SageMaker role arn string.\n",
    "* The S3 bucket that you want to use for training and storing model objects.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "isConfigCell": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import boto3\n",
    "import time\n",
    "import re\n",
    "from sagemaker import get_execution_role\n",
    "\n",
    "role = get_execution_role()\n",
    "\n",
    "# Now let's define the S3 bucket we'll used for the remainder of this example.\n",
    "\n",
    "bucket = '<your_s3_bucket_name_here>' #  enter your s3 bucket where you will copy data and model artificats\n",
    "prefix = 'sagemaker/DEMO-xgboost'  # place to upload training files within the bucket"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's bring in the Python libraries that we'll use throughout the analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np                                # For matrix operations and numerical processing\n",
    "import pandas as pd                               # For munging tabular data\n",
    "import sklearn as sk                              # For access to a variety of machine learning models\n",
    "import matplotlib.pyplot as plt                   # For charts and visualizations\n",
    "from IPython.display import Image                 # For displaying images in the notebook\n",
    "from IPython.display import display               # For displaying outputs in the notebook\n",
    "from sklearn.datasets import dump_svmlight_file   # For outputting data to libsvm format for xgboost\n",
    "from time import gmtime, strftime                 # For labeling SageMaker models, endpoints, etc.\n",
    "import sys                                        # For writing outputs to notebook\n",
    "import math                                       # For ceiling function\n",
    "import json                                       # For parsing hosting output\n",
    "import io                                         # For working with stream data\n",
    "import sagemaker.amazon.common as smac            # For protobuf data format"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Data \n",
    "Let's start by downloading publicly available Census Income dataset avaiable at https://archive.ics.uci.edu/ml/datasets/Adult. In this dataset we have different attributes such as age, workclass, education, country, race etc for each person. We also have an indicator of person's income being more than $50K an year. The prediction task is to determine whether a person makes over 50K a year. \n",
    "\n",
    "* Data comes in two separate files: adult.data and adult.test\n",
    "* The field names as well as additional information is available in the file adult.names\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now lets read this into a Pandas data frame and take a look."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "## read the data\n",
    "data = pd.read_csv(\"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data\", header = None)\n",
    "\n",
    "## read test data\n",
    "data_test = pd.read_csv(\"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test\", header = None, skiprows=1)\n",
    "\n",
    "## set column names\n",
    "data.columns = ['age', 'workclass','fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', \n",
    "'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'IncomeGroup']\n",
    "\n",
    "data_test.columns = ['age', 'workclass','fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', \n",
    "'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'IncomeGroup']\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Exploration\n",
    "### Data exploration and transformations\n",
    "\n",
    "In what follows we will do a basic exploration of the dataset to understand the size of data, various fields it has, the values different features take, distribution of target values etc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# set display options\n",
    "pd.set_option('display.max_columns', 100)     # Make sure we can see all of the columns\n",
    "pd.set_option('display.max_rows', 6)         # Keep the output on one page\n",
    "\n",
    "# disply data\n",
    "display(data)\n",
    "display(data_test)\n",
    "\n",
    "# display positive and negative counts\n",
    "display(data.iloc[:,14].value_counts())\n",
    "display(data_test.iloc[:,14].value_counts())\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "## Combine the two datasets to convert the categorical values to binary indicators\n",
    "data_combined = pd.concat([data, data_test])\n",
    "\n",
    "## convert the categorical variables to binary indicators\n",
    "data_combined_bin = pd.get_dummies(data_combined, prefix=['workclass', 'education', 'marital-status', 'occupation', 'relationship', \n",
    "                                                          'race', 'sex', 'native-country', 'IncomeGroup'], drop_first=True)\n",
    "\n",
    "# combine the income >50k indicators\n",
    "Income_50k = ((data_combined_bin.iloc[:,101]==1) | (data_combined_bin.iloc[:,102] ==1))+0;\n",
    "\n",
    "# make the income indicator as first column\n",
    "data_combined_bin = pd.concat([Income_50k, data_combined_bin.iloc[:, 0:100]], axis=1)\n",
    "\n",
    "# Post conversion to binary split the data sets separately\n",
    "data_bin = data_combined_bin.iloc[0:data.shape[0], :]\n",
    "data_test_bin = data_combined_bin.iloc[data.shape[0]:, :]\n",
    "\n",
    "# display the data sets post conversion to binary indicators\n",
    "display(data_bin)\n",
    "display(data_test_bin)\n",
    "\n",
    "# count number of positives and negatives\n",
    "display(data_bin.iloc[:,0].value_counts())\n",
    "display(data_test_bin.iloc[:,0].value_counts())\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data Description\n",
    "Let's talk about the data.  At a high level, we can see:\n",
    "\n",
    "* There are 15 columns and around 32K rows in the training data\n",
    "* There are 15 columns and around 16 K rows in the test data\n",
    "* IncomeGroup is the target field\n",
    "\n",
    "_**Specifics on the features:**_ \n",
    "* 9 of the 14 features are categorical and remaining 5 are numeric\n",
    "* When we convert the catgorical features to binary we find there are altogether 103-1 =102 features\n",
    "\n",
    "**Target variable:**\n",
    "* `IncomeGroup_>50K`: Whether or not annual income was more than 50K"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Training\n",
    "\n",
    "As our first training algorithm we pick `xgboost` algorithm.  `xgboost` is an extremely popular, open-source package for gradient boosted trees.  It is computationally powerful, fully featured, and has been successfully used in many machine learning competitions.  Let's start with a simple `xgboost` model, trained using `SageMaker's` managed, distributed training framework.\n",
    "\n",
    "First we'll need to specify training parameters.  This includes:\n",
    "1. The role to use\n",
    "1. Our training job name\n",
    "1. The `xgboost` algorithm container\n",
    "1. Training instance type and count\n",
    "1. S3 location for training data\n",
    "1. S3 location for output data\n",
    "1. Algorithm hyperparameters\n",
    "1. Stopping conditions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Supported Training Input Format: csv, libsvm.\n",
    "For csv input, right now we assume the input is separated by delimiter(automatically detect the separator by Python’s builtin sniffer tool), without a header line and also label is in the first column.\n",
    "Scoring Output Format: csv.\n",
    "\n",
    "* Since our data is in CSV format, we will convert our dataset to the way SageMaker's XGboost supports.\n",
    "* We will keep the target field in first column and remaining features in the next few columns\n",
    "* We will remove the header line\n",
    "* We will also split the data into a separate training and validation sets\n",
    "* Store the data into our s3 bucket\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Split the data into 80% training and 20% validation and save it before calling XGboost"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Split the data randomly as 80% for training and remaining 20% and save them locally\n",
    "train_list = np.random.rand(len(data_bin)) < 0.8\n",
    "data_train = data_bin[train_list]\n",
    "data_val = data_bin[~train_list]\n",
    "data_train.to_csv(\"formatted_train.csv\", sep=',', header=False, index=False) # save training data \n",
    "data_val.to_csv(\"formatted_val.csv\", sep=',', header=False, index=False) # save validation data\n",
    "data_test_bin.to_csv(\"formatted_test.csv\", sep=',', header=False,  index=False) # save test data\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Upload training and validation data sets in the s3 bucket and prefix provided"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "\n",
    "train_file = 'formatted_train.csv'\n",
    "val_file = 'formatted_val.csv'\n",
    "\n",
    "boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/', train_file)).upload_file(train_file)\n",
    "boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'val/', val_file)).upload_file(val_file)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Specify images used for training and hosting SageMaker's Xgboost algorithm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sagemaker.amazon.amazon_estimator import get_image_uri\n",
    "container = get_image_uri(boto3.Session().region_name, 'xgboost')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import boto3\n",
    "from time import gmtime, strftime\n",
    "\n",
    "job_name = 'DEMO-xgboost-single-censusincome-' + strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n",
    "print(\"Training job\", job_name)\n",
    "\n",
    "create_training_params = \\\n",
    "{\n",
    "    \"AlgorithmSpecification\": {\n",
    "        \"TrainingImage\": container,\n",
    "        \"TrainingInputMode\": \"File\"\n",
    "    },\n",
    "    \"RoleArn\": role,\n",
    "    \"OutputDataConfig\": {\n",
    "        \"S3OutputPath\": \"s3://{}/{}/single-xgboost/\".format(bucket, prefix),\n",
    "    },\n",
    "    \"ResourceConfig\": {\n",
    "        \"InstanceCount\": 1,\n",
    "        \"InstanceType\": \"ml.m4.4xlarge\",\n",
    "        \"VolumeSizeInGB\": 20\n",
    "    },\n",
    "    \"TrainingJobName\": job_name,\n",
    "    \"HyperParameters\": {\n",
    "        \"max_depth\":\"5\",\n",
    "        \"eta\":\"0.1\",\n",
    "        \"gamma\":\"1\",\n",
    "        \"min_child_weight\":\"1\",\n",
    "        \"silent\":\"0\",\n",
    "        \"objective\": \"binary:logistic\",\n",
    "        \"eval_metric\": \"auc\",\n",
    "        \"num_round\": \"20\"\n",
    "    },\n",
    "    \"StoppingCondition\": {\n",
    "        \"MaxRuntimeInSeconds\": 60 * 60\n",
    "    },\n",
    "    \"InputDataConfig\": [\n",
    "        {\n",
    "            \"ChannelName\": \"train\",\n",
    "            \"DataSource\": {\n",
    "                \"S3DataSource\": {\n",
    "                    \"S3DataType\": \"S3Prefix\",\n",
    "                    \"S3Uri\":  \"s3://{}/{}/train/\".format(bucket, prefix),\n",
    "                    \"S3DataDistributionType\": \"FullyReplicated\"\n",
    "                }\n",
    "            },\n",
    "            \"ContentType\": \"csv\",\n",
    "            \"CompressionType\": \"None\"\n",
    "        },\n",
    "        {\n",
    "            \"ChannelName\": \"validation\",\n",
    "            \"DataSource\": {\n",
    "                \"S3DataSource\": {\n",
    "                    \"S3DataType\": \"S3Prefix\",\n",
    "                    \"S3Uri\": \"s3://{}/{}/val/\".format(bucket, prefix),\n",
    "                    \"S3DataDistributionType\": \"FullyReplicated\"\n",
    "                }\n",
    "            },\n",
    "            \"ContentType\": \"csv\",\n",
    "            \"CompressionType\": \"None\"\n",
    "        }\n",
    "    ]\n",
    "}\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's kick off our training job in SageMaker's distributed, managed training, using the parameters we just created. Because training is managed, we don't have to wait for our job to finish to continue, but for this case, let's setup a while loop so we can monitor the status of our training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "region = boto3.Session().region_name\n",
    "sm = boto3.client('sagemaker')\n",
    "\n",
    "sm.create_training_job(**create_training_params)\n",
    "\n",
    "status = sm.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']\n",
    "print(status)\n",
    "sm.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=job_name)\n",
    "if status == 'Failed':\n",
    "    message = sm.describe_training_job(TrainingJobName=job_name)['FailureReason']\n",
    "    print('Training failed with the following error: {}'.format(message))\n",
    "    raise Exception('Training job failed')  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can read the training and evluation metrics from AWS cloudwatch.\n",
    "train-auc: 0.916177 and \n",
    "validation-auc:0.906567.\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Hosting \n",
    "Now that we've trained the `xgboost` algorithm on our data, let's setup a model which can later be hosted.  We will:\n",
    "1. Point to the scoring container\n",
    "1. Point to the model.tar.gz that came from training\n",
    "1. Create the hosting model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "model_name=job_name + '-mdl'\n",
    "xgboost_hosting_container = {\n",
    "    'Image': container,\n",
    "    'ModelDataUrl': sm.describe_training_job(TrainingJobName=job_name)['ModelArtifacts']['S3ModelArtifacts'],\n",
    "    'Environment': {'this': 'is'}\n",
    "}\n",
    "\n",
    "create_model_response = sm.create_model(\n",
    "    ModelName=model_name,\n",
    "    ExecutionRoleArn=role,\n",
    "    PrimaryContainer=xgboost_hosting_container)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print(create_model_response['ModelArn'])\n",
    "print(sm.describe_training_job(TrainingJobName=job_name)['ModelArtifacts']['S3ModelArtifacts'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once we've setup a model, we can configure what our hosting endpoints should be.  Here we specify:\n",
    "1. EC2 instance type to use for hosting\n",
    "1. Initial number of instances\n",
    "1. Our hosting model name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from time import gmtime, strftime\n",
    "\n",
    "endpoint_config_name = 'DEMO-XGBoostEndpointConfig-' + strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n",
    "print(endpoint_config_name)\n",
    "create_endpoint_config_response = sm.create_endpoint_config(\n",
    "    EndpointConfigName = endpoint_config_name,\n",
    "    ProductionVariants=[{\n",
    "        'InstanceType':'ml.m4.xlarge',\n",
    "        'InitialInstanceCount':1,\n",
    "        'InitialVariantWeight':1,\n",
    "        'ModelName':model_name,\n",
    "        'VariantName':'AllTraffic'}])\n",
    "\n",
    "print(\"Endpoint Config Arn: \" + create_endpoint_config_response['EndpointConfigArn'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create endpoint\n",
    "Lastly, the customer creates the endpoint that serves up the model, through specifying the name and configuration defined above. The end result is an endpoint that can be validated and incorporated into production applications. This takes 9-11 minutes to complete."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "import time\n",
    "\n",
    "endpoint_name = 'DEMO-XGBoostEndpoint-' + strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n",
    "print(endpoint_name)\n",
    "create_endpoint_response = sm.create_endpoint(\n",
    "    EndpointName=endpoint_name,\n",
    "    EndpointConfigName=endpoint_config_name)\n",
    "print(create_endpoint_response['EndpointArn'])\n",
    "\n",
    "resp = sm.describe_endpoint(EndpointName=endpoint_name)\n",
    "status = resp['EndpointStatus']\n",
    "print(\"Status: \" + status)\n",
    "\n",
    "while status=='Creating':\n",
    "    time.sleep(60)\n",
    "    resp = sm.describe_endpoint(EndpointName=endpoint_name)\n",
    "    status = resp['EndpointStatus']\n",
    "    print(\"Status: \" + status)\n",
    "\n",
    "print(\"Arn: \" + resp['EndpointArn'])\n",
    "print(\"Status: \" + status)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Evaluation\n",
    "There are many ways to compare the performance of a machine learning model. In this example, we will generate predictions and compare the ranking metric AUC (Area Under the ROC Curve)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "runtime= boto3.client('runtime.sagemaker')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Simple function to create a csv from our numpy array\n",
    "\n",
    "def np2csv(arr):\n",
    "    csv = io.BytesIO()\n",
    "    np.savetxt(csv, arr, delimiter=',', fmt='%g')\n",
    "    return csv.getvalue().decode().rstrip()\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Function to generate prediction through sample data\n",
    "def do_predict(data, endpoint_name, content_type):\n",
    "    \n",
    "    payload = np2csv(data)\n",
    "    response = runtime.invoke_endpoint(EndpointName=endpoint_name, \n",
    "                                   ContentType=content_type, \n",
    "                                   Body=payload)\n",
    "    result = response['Body'].read()\n",
    "    result = result.decode(\"utf-8\")\n",
    "    result = result.split(',')\n",
    "    preds = [float((num)) for num in result]\n",
    "    return preds\n",
    "\n",
    "# Function to iterate through a larger data set and generate batch predictions\n",
    "def batch_predict(data, batch_size, endpoint_name, content_type):\n",
    "    items = len(data)\n",
    "    arrs = []\n",
    "    \n",
    "    for offset in range(0, items, batch_size):\n",
    "        if offset+batch_size < items:\n",
    "            datav = data.iloc[offset:(offset+batch_size),:].as_matrix()\n",
    "            results = do_predict(datav, endpoint_name, content_type)\n",
    "            arrs.extend(results)\n",
    "        else:\n",
    "            datav = data.iloc[offset:items,:].as_matrix()\n",
    "            arrs.extend(do_predict(datav, endpoint_name, content_type))\n",
    "        sys.stdout.write('.')\n",
    "    return(arrs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "### read the saved data for scoring\n",
    "data_train = pd.read_csv(\"formatted_train.csv\", sep=',', header=None) \n",
    "data_test = pd.read_csv(\"formatted_test.csv\", sep=',', header=None) \n",
    "data_val = pd.read_csv(\"formatted_val.csv\", sep=',', header=None) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generate predictions on train, validation and test sets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "\n",
    "preds_train_xgb = batch_predict(data_train.iloc[:, 1:], 1000, endpoint_name, 'text/csv')\n",
    "preds_val_xgb = batch_predict(data_val.iloc[:, 1:], 1000, endpoint_name, 'text/csv')\n",
    "preds_test_xgb = batch_predict(data_test.iloc[:, 1:], 1000, endpoint_name, 'text/csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Compute performance metrics on the training,validation, test data sets\n",
    "### compute auc/ginni "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.metrics import roc_auc_score\n",
    "train_labels = data_train.iloc[:,0];\n",
    "val_labels = data_val.iloc[:,0];\n",
    "test_labels = data_test.iloc[:,0];\n",
    "\n",
    "print(\"Training AUC\", roc_auc_score(train_labels, preds_train_xgb)) ##0.9161\n",
    "print(\"Validation AUC\", roc_auc_score(val_labels, preds_val_xgb) )###0.9065\n",
    "print(\"Test AUC\", roc_auc_score(test_labels, preds_test_xgb) )###0.9112\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Linear-Model\n",
    "### Train a second model using SageMaker's Linear Learner"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "prefix = 'sagemaker/DEMO-linear'  ##subfolder inside the data bucket to be used for Linear Learner\n",
    "\n",
    "data_train = pd.read_csv(\"formatted_train.csv\", sep=',', header=None) \n",
    "data_test = pd.read_csv(\"formatted_test.csv\", sep=',', header=None) \n",
    "data_val = pd.read_csv(\"formatted_val.csv\", sep=',', header=None) \n",
    "\n",
    "train_y = data_train.iloc[:,0].as_matrix();\n",
    "train_X = data_train.iloc[:,1:].as_matrix();\n",
    "\n",
    "val_y = data_val.iloc[:,0].as_matrix();\n",
    "val_X = data_val.iloc[:,1:].as_matrix();\n",
    "\n",
    "test_y = data_test.iloc[:,0].as_matrix();\n",
    "test_X = data_test.iloc[:,1:].as_matrix();\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we'll convert the datasets to the recordIO wrapped protobuf format used by the Amazon SageMaker algorithms and upload this data to S3.  We'll start with training data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Convert to protobuf format and upload the training and validation data to s3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "train_file = 'linear_train.data'\n",
    "\n",
    "f = io.BytesIO()\n",
    "smac.write_numpy_to_dense_tensor(f, train_X.astype('float32'), train_y.astype('float32'))\n",
    "f.seek(0)\n",
    "\n",
    "boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', train_file)).upload_fileobj(f)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "validation_file = 'linear_validation.data'\n",
    "\n",
    "f = io.BytesIO()\n",
    "smac.write_numpy_to_dense_tensor(f, val_X.astype('float32'), val_y.astype('float32'))\n",
    "f.seek(0)\n",
    "\n",
    "boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation', train_file)).upload_fileobj(f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "### Training Algorithm Specifications\n",
    "\n",
    "Now we can begin to specify our linear model.  Amazon SageMaker's Linear Learner actually fits many models in parallel, each with slightly different hyperparameters, and then returns the one with the best fit.  This functionality is automatically enabled.  We can influence this using parameters like:\n",
    "\n",
    "- `num_models` to increase to total number of models run.  The specified parameters will always be one of those models, but the algorithm also chooses models with nearby parameter values in order to find a solution nearby that may be more optimal.  In this case, we're going to use the max of 32.\n",
    "- `loss` which controls how we penalize mistakes in our model estimates.  For this case, let's use logistic loss as we are interested in estimating probabilities.\n",
    "- `wd` or `l1` which control regularization.  Regularization can prevent model overfitting by preventing our estimates from becoming too finely tuned to the training data, which can actually hurt generalizability.  In this case, we'll leave these parameters as their default \"auto\" though."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Specify images used for training and hosting SageMaker's linear-learner"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sagemaker.amazon.amazon_estimator import get_image_uri\n",
    "container = get_image_uri(boto3.Session().region_name, 'linear-learner')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "linear_job = 'DEMO-linear-' + time.strftime(\"%Y-%m-%d-%H-%M-%S\", time.gmtime())\n",
    "\n",
    "print(\"Job name is:\", linear_job)\n",
    "\n",
    "linear_training_params = {\n",
    "    \"RoleArn\": role,\n",
    "    \"TrainingJobName\": linear_job,\n",
    "    \"AlgorithmSpecification\": {\n",
    "        \"TrainingImage\": container,\n",
    "        \"TrainingInputMode\": \"File\"\n",
    "    },\n",
    "    \"ResourceConfig\": {\n",
    "        \"InstanceCount\": 1,\n",
    "        \"InstanceType\": \"ml.c4.2xlarge\",\n",
    "        \"VolumeSizeInGB\": 10\n",
    "    },\n",
    "    \"InputDataConfig\": [\n",
    "        {\n",
    "            \"ChannelName\": \"train\",\n",
    "            \"DataSource\": {\n",
    "                \"S3DataSource\": {\n",
    "                    \"S3DataType\": \"S3Prefix\",\n",
    "                    \"S3Uri\": \"s3://{}/{}/train/\".format(bucket, prefix),\n",
    "                    \"S3DataDistributionType\": \"ShardedByS3Key\"\n",
    "                }\n",
    "            },\n",
    "            \"CompressionType\": \"None\",\n",
    "            \"RecordWrapperType\": \"None\"\n",
    "        },\n",
    "        {\n",
    "            \"ChannelName\": \"validation\",\n",
    "            \"DataSource\": {\n",
    "                \"S3DataSource\": {\n",
    "                    \"S3DataType\": \"S3Prefix\",\n",
    "                    \"S3Uri\": \"s3://{}/{}/validation/\".format(bucket, prefix),\n",
    "                    \"S3DataDistributionType\": \"FullyReplicated\"\n",
    "                }\n",
    "            },\n",
    "            \"CompressionType\": \"None\",\n",
    "            \"RecordWrapperType\": \"None\"\n",
    "        }\n",
    "\n",
    "    ],\n",
    "    \"OutputDataConfig\": {\n",
    "        \"S3OutputPath\": \"s3://{}/{}/\".format(bucket, prefix)\n",
    "    },\n",
    "    \"HyperParameters\": {\n",
    "        \"feature_dim\": \"100\",\n",
    "        \"mini_batch_size\": \"100\",\n",
    "        \"predictor_type\": \"binary_classifier\",\n",
    "        \"epochs\": \"10\",\n",
    "        \"num_models\": \"32\",\n",
    "        \"loss\": \"logistic\"\n",
    "    },\n",
    "    \"StoppingCondition\": {\n",
    "        \"MaxRuntimeInSeconds\": 60 * 60\n",
    "    }\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print(linear_job)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's kick off our training job in SageMaker's distributed, managed training, using the parameters we just created.  Because training is managed, we don't have to wait for our job to finish to continue, but for this case, let's setup a while loop so we can monitor the status of our training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "region = boto3.Session().region_name\n",
    "sm = boto3.client('sagemaker')\n",
    "\n",
    "sm.create_training_job(**linear_training_params)\n",
    "status = sm.describe_training_job(TrainingJobName=linear_job)['TrainingJobStatus']\n",
    "print(status)\n",
    "sm.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=linear_job)\n",
    "if status == 'Failed':\n",
    "    message = sm.describe_training_job(TrainingJobName=linear_job)['FailureReason']\n",
    "    print('Training failed with the following error: {}'.format(message))\n",
    "    raise Exception('Training job failed')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Hosting:Linear-Learner\n",
    "\n",
    "Now that we've trained the linear algorithm on our data, let's setup a model which can later be hosted.  We will:\n",
    "1. Point to the scoring container\n",
    "1. Point to the model.tar.gz that came from training\n",
    "1. Create the hosting model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "\n",
    "linear_hosting_container = {\n",
    "   'Image': container,\n",
    "   'ModelDataUrl': sm.describe_training_job(TrainingJobName=linear_job)['ModelArtifacts']['S3ModelArtifacts']\n",
    "}\n",
    "\n",
    "#174872318107.dkr.ecr.us-west-2.amazonaws.com\n",
    "\n",
    "create_model_response = sm.create_model(\n",
    "    ModelName=linear_job,\n",
    "    ExecutionRoleArn=role,\n",
    "    PrimaryContainer=linear_hosting_container)\n",
    "\n",
    "print(create_model_response['ModelArn'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once we've setup a model, we can configure what our hosting endpoints should be.  Here we specify:\n",
    "1. EC2 instance type to use for hosting\n",
    "1. Initial number of instances\n",
    "1. Our hosting model name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "linear_endpoint_config = 'DEMO-linear-endpoint-config-' + time.strftime(\"%Y-%m-%d-%H-%M-%S\", time.gmtime())\n",
    "print(linear_endpoint_config)\n",
    "create_endpoint_config_response = sm.create_endpoint_config(\n",
    "    EndpointConfigName=linear_endpoint_config,\n",
    "    ProductionVariants=[{\n",
    "        'InstanceType': 'ml.m4.xlarge',\n",
    "        'InitialInstanceCount': 1,\n",
    "        'ModelName': linear_job,\n",
    "        'VariantName': 'AllTraffic'}])\n",
    "\n",
    "print(\"Endpoint Config Arn: \" + create_endpoint_config_response['EndpointConfigArn'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we've specified how our endpoint should be configured, we can create them.  This can be done in the background, but for now let's run a loop that updates us on the status of the endpoints so that we know when they are ready for use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "linear_endpoint = 'DEMO-linear-endpoint-' + time.strftime(\"%Y%m%d%H%M\", time.gmtime())\n",
    "print(linear_endpoint)\n",
    "create_endpoint_response = sm.create_endpoint(\n",
    "    EndpointName=linear_endpoint,\n",
    "    EndpointConfigName=linear_endpoint_config)\n",
    "print(create_endpoint_response['EndpointArn'])\n",
    "\n",
    "resp = sm.describe_endpoint(EndpointName=linear_endpoint)\n",
    "status = resp['EndpointStatus']\n",
    "print(\"Status: \" + status)\n",
    "\n",
    "sm.get_waiter('endpoint_in_service').wait(EndpointName=linear_endpoint)\n",
    "\n",
    "resp = sm.describe_endpoint(EndpointName=linear_endpoint)\n",
    "status = resp['EndpointStatus']\n",
    "print(\"Arn: \" + resp['EndpointArn'])\n",
    "print(\"Status: \" + status)\n",
    "\n",
    "if status != 'InService':\n",
    "    raise Exception('Endpoint creation did not succeed')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prediction:Linear-Learner\n",
    "#### Predict using SageMaker's Linear Learner and evaluate the performance\n",
    "\n",
    "Now that we have our hosted endpoint, we can generate statistical predictions from it.  Let's predict on our test dataset to understand how accurate our model is on unseen samples using AUC metric."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def np2csv(arr):\n",
    "    csv = io.BytesIO()\n",
    "    np.savetxt(csv, arr, delimiter=',', fmt='%g')\n",
    "    return csv.getvalue().decode().rstrip()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Function to generate prediction through sample data\n",
    "def do_predict_linear(data, endpoint_name, content_type):\n",
    "    \n",
    "    payload = np2csv(data)\n",
    "    response = runtime.invoke_endpoint(EndpointName=endpoint_name, \n",
    "                                   ContentType=content_type, \n",
    "                                   Body=payload)\n",
    "    result = json.loads(response['Body'].read().decode())\n",
    "    preds =  [r['score'] for r in result['predictions']]\n",
    "\n",
    "    return preds\n",
    "\n",
    "# Function to iterate through a larger data set and generate batch predictions\n",
    "def batch_predict_linear(data, batch_size, endpoint_name, content_type):\n",
    "    items = len(data)\n",
    "    arrs = []\n",
    "    \n",
    "    for offset in range(0, items, batch_size):\n",
    "        if offset+batch_size < items:\n",
    "            datav = data.iloc[offset:(offset+batch_size),:].as_matrix()\n",
    "            results = do_predict_linear(datav, endpoint_name, content_type)\n",
    "            arrs.extend(results)\n",
    "        else:\n",
    "            datav = data.iloc[offset:items,:].as_matrix()\n",
    "            arrs.extend(do_predict_linear(datav, endpoint_name, content_type))\n",
    "        sys.stdout.write('.')\n",
    "    return(arrs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "### Predict on Training Data\n",
    "preds_train_lin = batch_predict_linear(data_train.iloc[:,1:], 100, linear_endpoint , 'text/csv')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "### Predict on Validation Data\n",
    "preds_val_lin = batch_predict_linear(data_val.iloc[:,1:], 100, linear_endpoint , 'text/csv')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "### Predict on Test Data\n",
    "preds_test_lin = batch_predict_linear(data_test.iloc[:,1:], 100, linear_endpoint , 'text/csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluation of Linear Learner"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print(\"Training AUC\", roc_auc_score(train_labels, preds_train_lin)) ##0.9091\n",
    "print(\"Validation AUC\", roc_auc_score(val_labels, preds_val_lin) )###0.8998\n",
    "print(\"Test AUC\", roc_auc_score(test_labels, preds_test_lin) )###0.9033\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Ensemble\n",
    "### Perform simple average of the two models and evaluate on training, validaion and test sets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "ens_train = 0.5*np.array(preds_train_xgb) + 0.5*np.array(preds_train_lin);\n",
    "ens_val = 0.5*np.array(preds_val_xgb) + 0.5*np.array(preds_val_lin);\n",
    "ens_test = 0.5*np.array(preds_test_xgb) + 0.5*np.array(preds_test_lin);\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluate-Ensemble\n",
    "### Evaluate the combined ensemble model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#Print AUC of the combined model\n",
    "print(\"Train AUC- Xgboost\", round(roc_auc_score(train_labels, preds_train_xgb),5))\n",
    "print(\"Train AUC- Linear\", round(roc_auc_score(train_labels, preds_train_lin),5))\n",
    "print(\"Train AUC- Ensemble\", round(roc_auc_score(train_labels, ens_train),5))\n",
    "\n",
    "print(\"=======================================\")\n",
    "print(\"Validation AUC- Xgboost\", round(roc_auc_score(val_labels, preds_val_xgb),5))\n",
    "print(\"Validation AUC- Linear\", round(roc_auc_score(val_labels, preds_val_lin),5))\n",
    "print(\"Validation AUC- Ensemble\", round(roc_auc_score(val_labels, ens_val),5))\n",
    "\n",
    "print(\"======================================\")\n",
    "print(\"Test AUC- Xgboost\", round(roc_auc_score(test_labels, preds_test_xgb),5)) \n",
    "print(\"Test AUC- Linear\", round(roc_auc_score(test_labels, preds_test_lin),5)) \n",
    "print(\"Test AUC- Ensemble\", round(roc_auc_score(test_labels, ens_test),5))   \n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### We observe that by doing a simple average of the two predictions we get improved AUC compared either of the two models on all training, validation and test data sets.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save Final prediction on test-data "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "final = pd.concat([data_test.iloc[:,0], pd.DataFrame(ens_test)], axis=1)\n",
    "final.to_csv(\"Xgboost-linear-ensemble-prediction.csv\", sep=',', header=False, index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Run below to delete endpoints once you are done."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sm.delete_endpoint(EndpointName=endpoint_name)\n",
    "sm.delete_endpoint(EndpointName=linear_endpoint)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Extensions\n",
    "\n",
    "This example analyzed a relatively small dataset, but utilized SageMaker features such as,\n",
    "* managed single-machine training of XGboost model \n",
    "* managed training of Linear Learner\n",
    "* highly available, real-time model hosting, \n",
    "* doing a batch prediction using the hosted model\n",
    "* Doing an ensemble of Xgboost and Linear Learner\n",
    "\n",
    "This example can be extended in several ways using SageMaker features such as,\n",
    "* Distributed training of Xgboost/Linear model\n",
    "* Picking a different model for training\n",
    "* Training a separate model for peforming the ensemble instead of a taking a simple average.\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Environment (conda_python3)",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  },
  "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved.  Licensed under the Apache License, Version 2.0 (the License). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the license file accompanying this file. This file is distributed on an AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
